Trade-offs between sensitivity and specificity

Recidivism, or the tendency of a person previously convicted of a crime to reoffend, can be critical to understand in order to effectively target limited time and resources. Using a model to predict recidivism might help more fairly distribute resources to prevent justice-involved people from falling back into prison. Said model might just as easily be manipulated to tighten the disproportionate hold the parole and probation system has on certain communities.

This duality is highlighted by the decisions that go into a logistic regression, or logit, model which uses potential predictors to come up with a binary (yes/no) outcome. One of the most important considerations in building a logit model are the the tradeoffs between sensitivity and specificity:

  • High sensitivity would indicate that the model has a low rate of false negatives (a person is predicted to not reoffend, bud did reoffend). It might be important to keep false negatives low if you have a limited budget or amount of time and want to target resources most heavily towards those who are likely to reoffend.
  • High specificity would indicate that the model low rate of false positives (a person is predicted to reoffend, but did not reoffend). A false positive is dangerous if someone is labeled as a “likely reoffender” and they have greater burdens put on them in a system of parole and probation that is already deeply burdensome and dehumanizing.
options(scipen=10000000)


library(tidyverse)
library(kableExtra)
library(caret)
library(knitr) 
library(pscl)
library(plotROC)
library(pROC)
library(lubridate)
library(tidycensus)
library(gt)
library(gtExtras)
library(sf)
library(janitor)
library(tigris)
library(RColorBrewer)
library(stargazer)

ga_recid <- read.csv("https://raw.githubusercontent.com/bwbrad/PPA_HW4_Logit/main/Data/NIJ_s_Recidivism_Challenge_Full_Dataset_20240409.csv",
                     na.strings = c(""," ","NA")) %>% 
  clean_names("snake") %>% 
  mutate(recidivism_within_3years = case_when(
           recidivism_within_3years == "true" ~ 1,
           recidivism_within_3years == "false" ~ 0),
         recid_dv_char = case_when(
           recidivism_within_3years == 1 ~ "Reoffended in 3 years",
           recidivism_within_3years == 0 ~ "Did not reoffend in 3 years")) %>% 
  filter(!is.na(percent_days_employed))

recid_sum <- ga_recid %>% 
  group_by(recid_dv_char) %>% 
  summarize(people = n()) %>% 
  mutate(pct_people = people / sum(people))

Choosing Variables

This model is built using data from Georgia’s Office of Justice Programs on a subset of the state’s justice-involved population up through 2021. From a high level, 58% of people reoffend within 3 years. A greater share of reoffenses from the start means that whatever model we build will do a better job predicting those who do reoffend opposed to those who do not.

gt(recid_sum) %>% 
  fmt_number(columns = people, decimals = 0) %>% 
  fmt_percent(columns = pct_people, decimals = 1) %>% 
  cols_label(recid_dv_char = "Outcome",
             people = "Number",
             pct_people = "Percent") %>% 
  tab_header(title = md("**Overview of justice-involved population**")) %>% 
  gt_theme_nytimes() %>%  
  opt_row_striping()
Overview of justice-involved population
Outcome Number Percent
Did not reoffend in 3 years 10,469 41.3%
Reoffended in 3 years 14,904 58.7%

Below is a discussion on a variety of variables included in the Georgia dataset and their relationship to recidisim:

Education Level

People who earned a HS diploma or less are much more represented in those which reoffended in under 3 years, suggesting level of education plays an important role in recidivism rates.

recid_by_education <- ga_recid %>% 
  group_by(education_level, recid_dv_char) %>% 
  summarize(justice_inv_pop = n())

ggplot() +
  geom_col(data = recid_by_education,
           aes(x = education_level, y = justice_inv_pop, fill = education_level),
           position = "dodge",
           show.legend = FALSE) +
  scale_fill_brewer(type = "qual",
                    palette = "Paired") +
  scale_y_continuous(labels = scales::label_comma()) +
  facet_wrap(~ recid_dv_char) +
  theme_minimal() +
  labs(title = "Recidivism by eductional attainment",
       x = "Level of educational attainment",
       y = "Justice-involved pop.") +
  theme(plot.title = element_text(face = "bold", margin = margin(b = 6)),
        legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1))

Days Worked

The median percent of days employed is much higher for those who did not reoffend. What jumps out particularly is the substantially greater number of reoffenses for people who were employed zero days. This suggests that stability of income and employment may be critical in keeping formerly convicted people out of the justice system.

ggplot() +
  geom_histogram(data = ga_recid,
                 aes(x = percent_days_employed, fill = recid_dv_char)) +
  scale_fill_manual(values = c("#1f78b4", "#ff7f00"),
                    guide = "none") +
  geom_vline(data = ga_recid %>%
               filter(recid_dv_char == "Reoffended in 3 years"),
             aes(xintercept = median(percent_days_employed, na.rm = T),
                 colour = "Reoffended in 3 years"),
             lwd = 0.8,
             lty = "longdash",
             show.legend = TRUE) +
  geom_vline(data = ga_recid %>%
               filter(recid_dv_char == "Did not reoffend in 3 years"),
             aes(xintercept = median(percent_days_employed, na.rm = T),
                 colour = "Did not reoffend in 3 years"),
             lwd = 0.8,
             lty = "dotdash",
             show.legend = TRUE) +
  scale_color_manual(name = "Median",
                     values = c("Reoffended in 3 years" = "#ff7f00",
                                "Did not reoffend in 3 years" = "#1f78b4")) +
  scale_y_continuous(labels = scales::label_comma()) +
  scale_x_continuous(labels = scales::label_percent()) +
  facet_wrap(~ recid_dv_char, scales = "free") +
  theme_minimal() +
  labs(title = "Recidivism by % days employed",
       x = "Share of days employed",
       y = "Justice-involved pop.") +
  theme(plot.title = element_text(face = "bold", margin = margin(b = 6)),
        legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1))

Prison Offense

People who are charged with property offenses are more represented in those which reoffended in under 3 years, suggesting type of prison offense plays an important role in recidivism rates.

recid_by_prisonoffense <- ga_recid %>% 
  group_by(prison_offense, recid_dv_char) %>% 
  summarize(justice_inv_pop = n()) %>% 
  filter(prison_offense != "NA")

ggplot() +
  geom_col(data = recid_by_prisonoffense,
           aes(x = prison_offense, y = justice_inv_pop, fill = prison_offense),
           position = "dodge",
           show.legend = FALSE) +
  scale_fill_brewer(type = "qual",
                    palette = "Paired") +
  scale_y_continuous(labels = scales::label_comma()) +
  facet_wrap(~ recid_dv_char, scales = "free") +
  theme_minimal() +
  labs(title = "Recidivism by prison offense",
       x = "Prison Offense",
       y = "Justice-involved pop.") +
  theme(plot.title = element_text(face = "bold", margin = margin(b = 6)),
        legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1))

Prison Years

People who are sentenced to fewer years in prison, specifically fewer than 2 years, are more likely to re-offend in 3 years than those who are given longer prison sentences.

This might be hard to push, but the high occurrence of people with repeated shorter sentences could suggest that those with lesser offences (assuming lesser offenses = shorter prison sentence) should not be sentenced at all. There could be an alternative intervention for these offenders to keep them out of the prison system for repeated short stints.

recid_by_PrisonYrs <- ga_recid %>% 
  group_by(prison_years, recid_dv_char) %>% 
  summarize(justice_inv_pop = n())

recid_by_PrisonYrs$prison_years <- 
  factor(recid_by_PrisonYrs$prison_years,
         levels = c("Less than 1 year", "1-2 years",
                    "Greater than 2 to 3 years", "More than 3 years"))

x_labs_PrisonYrs <- c("Less than 1 year", "1-2 years",
                      "2-3 years", "More than 3 years")

ggplot() +
  geom_col(data = recid_by_PrisonYrs,
           aes(x = prison_years, y = justice_inv_pop, fill = prison_years),
           position = "dodge",
           show.legend = FALSE) +
  scale_fill_brewer(type = "qual",
                    palette = "Paired") +
  scale_x_discrete(labels = x_labs_PrisonYrs) +
  scale_y_continuous(labels = scales::label_comma()) +
  facet_wrap(~ recid_dv_char, scales = "free") +
  theme_minimal() +
  labs(title = "Recidivism by prison years",
       x = "Prison years",
       y = "Justice-involved pop.") +
  theme(plot.title = element_text(face = "bold", margin = margin(b = 6)),
        legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1))

Age at Release

People who are younger at the time of their release are more likely to re-offend within three years. This suggests there needs to be more assistance to reintegrate younger people back into society after their prison release. Without proper social services and sustained behavioral intervention, younger people may be more inclined to fall back into previous behaviors.

recid_by_Age <- ga_recid %>% 
  group_by(age_at_release, recid_dv_char) %>% 
  summarize(justice_inv_pop = n())

ggplot() +
  geom_col(data = recid_by_Age,
           aes(x = age_at_release, y = justice_inv_pop, fill = age_at_release),
           position = "dodge",
           show.legend = FALSE) +
  scale_fill_brewer(type = "qual",
                    palette = "Paired") +
  scale_y_continuous(labels = scales::label_comma()) +
  facet_wrap(~ recid_dv_char, scales = "free") +
  theme_minimal() +
  labs(title = "Recidivism by age at release",
       x = "Age at Release",
       y = "Justice-involved pop.") +
  theme(plot.title = element_text(face = "bold", margin = margin(b = 6)),
        legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1))

Program Attendences

Unfortunately, most prisoners do not participate in programming, so it is difficult to determine the effect program attendance has on recidivism. There is some evidence that people who do not attend any programs during their original sentence are more likely to re-offend in three years. This may support the efforts of prison programming to prevent future offenses. Nevertheless, since the metric is statistically significantly associated with -.2 log odds for each program attended, we included it in our report.

recid_by_Program <- ga_recid %>% 
  group_by(program_attendances, recid_dv_char) %>% 
  summarize(justice_inv_pop = n())

recid_by_Program$program_attendances <- 
  factor(recid_by_Program$program_attendances,
         levels = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
                    "10", "10 or more"))

ggplot() +
  geom_col(data = recid_by_Program,
           aes(x = program_attendances, y = justice_inv_pop, fill = program_attendances),
           position = "dodge",
           show.legend = FALSE) +
  scale_fill_brewer(type = "qual",
                    palette = "Paired") +
  scale_y_continuous(labels = scales::label_comma()) +
  facet_wrap(~ recid_dv_char) +
  theme_minimal() +
  labs(title = "Recidivism by program attendances",
       x = "Program attendances",
       y = "Justice-involved pop.") +
  theme(plot.title = element_text(face = "bold", margin = margin(b = 6)),
        legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1))

Crafting and assessing a model for recidivism

Describing the best-performing model

ga_recid <- ga_recid %>% 
  mutate(prison_years_over3 = case_when(
           prison_years == "More than 3 years" ~ 1,
           TRUE ~ 0),
         programs_over10 = case_when(
           program_attendances == "10 or more" ~ 1,
           TRUE ~ 0),
         programs_none = case_when(
           program_attendances == "0" ~ 1,
           TRUE ~ 0),
         older_than37 = case_when(
           age_at_release %in% c("38-42", "43-47", "48 or older") ~ 1,
           age_at_release %in% c("18-22", "23-27", "23-27", 
                                 "28-32", "33-37") ~ 0),
         some_college = case_when(
           education_level == "At least some college" ~ 1,
           education_level %in% c("High School Diploma",
                                  "Less than HS diploma") ~ 0),
         high_supervision = case_when(
           supervision_level_first == "High" ~ 1,
           TRUE ~ 0),
         offense_property = case_when(
           prison_offense == "Property" ~ 1,
           TRUE ~ 0),
         offense_violent_sex = case_when(
           prison_offense == "Violent/Sex" ~ 1,
           TRUE ~ 0),
         one_dependent = case_when(
           dependents == "1" ~ 1,
           TRUE ~ 0),
    )

set.seed(3456)

train_index <- createDataPartition(ga_recid$recidivism_within_3years, p = .50,
                                  list = FALSE,
                                  times = 1)

recid_train <- ga_recid[ train_index,]
recid_test  <- ga_recid[-train_index,]

unique(ga_recid$supervision_level_first)

We made three different models, and settled on one with 12 different predictors of recidivism related to individual characteristics (education level, number of dependents), and criminal history (type of offense, number of years in prison). We found that our model predicted more effectively when we:

  • Created dummy variables to simplify categorical predictors
  • Removed measures for drug testing, failure to report violations, violations for movement without permission, and low program attendance.

Every predictor in our final model is significant at a 95% confidence interval. Our model also has an 8.4% decrease between estimates of null deviance and residual deviance, indicating that our variables increase the model’s effectiveness over the intercept alone.

See a table of our model’s predictions below. Our model includes values that both increase and decrease the log odds of recidivism, including percent of days employed (1.164 decrease in log odds of recidivism per percent) and dummy variable for property offense (.317 increase in log odds of recidivism for those who committed property offenses).

Among those most at risk of recidivism include those who have only been in prison a short time, those who committed property offenses, and those who had prior parole violations.

recid_model_3 <- glm(recidivism_within_3years ~ some_college + offense_property +
                       offense_violent_sex + percent_days_employed +
                       prison_years_over3 + older_than37 + programs_over10 +
                       high_supervision + one_dependent +
                       prior_arrest_episodes_dv_charges + prior_revocations_parole +
                       prior_revocations_probation,
                     data = ga_recid,
                     family="binomial" (link="logit")) 
stargazer(recid_model_3, type = "html")
Dependent variable:
recidivism_within_3years
some_college -0.396***
(0.035)
offense_property 0.317***
(0.030)
offense_violent_sex -0.588***
(0.084)
percent_days_employed -1.164***
(0.032)
prison_years_over3 -0.269***
(0.034)
older_than37 -0.570***
(0.029)
programs_over10 -0.298***
(0.042)
high_supervision 0.163***
(0.032)
one_dependent 0.094***
(0.034)
prior_arrest_episodes_dv_chargestrue 0.385***
(0.037)
prior_revocations_paroletrue 0.527***
(0.048)
prior_revocations_probationtrue 0.154***
(0.039)
Constant 1.033***
(0.031)
Observations 25,373
Log Likelihood -15,743.140
Akaike Inf. Crit. 31,512.270
Note: p<0.1; p<0.05; p<0.01

Model diagnostics

ROC curve

Our ROC curve shows that, while our model is not ideal, it still predicts better than a random prediction. It has an area under the curve of .7, which indicates that is a fair, but not good predictor of recidivism. One problem with the curve is that it is a gentle arc, with no clear change in slope that would help us know where to change our predictions for recidivism.

auc(recid_test$outcome, recid_test$probs)

ggplot(recid_test, aes(d = as.numeric(outcome), m = probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#1f78b4") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve: Recidivism Model") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", margin = margin(b = 6)))

Confusion Matrix

recid_test <- recid_test %>% 
  mutate(outcome = as.factor(recidivism_within_3years),
         probs = predict(recid_model_3, recid_test, type= "response"),
         pred_outcome = as.factor(case_when(probs >= 0.5 ~ 1,
                                            probs < 0.5 ~ 0)))

cm <- confusionMatrix(recid_test$pred_outcome, recid_test$outcome,
                      positive = "1")

cm_table <- cm$table
cm_df <- as.data.frame(cm_table)

rownames(cm_df) <- c("True Negative", "False Positive", 
                     "False Negative", "True Positive")

The confusion matrix below highlights our model’s relative sensitivity and specificity. It shows a True Positive value of 6,015, meaning 6,015 entries are correctly identified as positive. In this case, that means our model correctly predicts 6,015 people who experience recidivism within three years. The False Positive value is 2,869, meaning 2,869 people are incorrectly predicted to experience recidivism within three years. The False Negative value of 1,417 represents the number of people who will experience recidivism in the next three years that our model failed to predict. The True Negative value of 2,385 represents the number of people our model correctly predicted to stay out of the prison system within three years.

Our model has an accuracy of just 66%, however the sensitivity, or recall, of our model is 0.81, meaning it correctly identifies 81% of people who will experience recidivism within three years. The specificity of our model is 0.45, meaning the model correctly identifies 45% of people who will not reoffend in three years.

gt(cm_df %>% 
     rownames_to_column()) %>% 
  fmt_number(columns = Freq, decimals = 0) %>% 
  tab_header(title = md("**Confusion Matrix**")) %>%
  opt_align_table_header(align = "left") %>%
  cols_label(Freq = "Frequency") %>% 
  gt_theme_nytimes() %>% 
  opt_row_striping()
Confusion Matrix
Prediction Reference Frequency
True Negative 0 0 2,385
False Positive 1 0 2,869
False Negative 0 1 1,417
True Positive 1 1 6,015

Prediction error and disparities by race & gender

Analysis of the interactions between race and gender suggest that the model underpredicts for white men and women compared to Black men and women. The rate of false positives for Black men is 22.8% vs. 19.7% for white men. Similarly, the false positive rate for Black women is 33.3% vs. 28.8% for white women. Recidivism is overall are less accurately predicted for women.

recid_test <- recid_test %>% 
  mutate(cm = case_when((outcome == 1) & (pred_outcome == 1) ~ "True Positive",
                        (outcome == 0) & (pred_outcome == 1) ~ "False Positive",
                        (outcome == 0) & (pred_outcome == 0) ~ "True Negative",
                        (outcome == 1) & (pred_outcome == 0) ~ "False Negative"),
         gender = case_when(gender == "F" ~ "Women",
                            gender == "M" ~ "Men"),
         race = case_when(race == "BLACK" ~ "Black",
                          race == "WHITE" ~ "White"))

error_by_demographics <- recid_test %>% 
  group_by(gender, race, cm) %>% 
  tally() %>% 
  mutate(pct_outcome = n / sum(n))
gt(error_by_demographics) %>% 
  fmt_number(columns = n, decimals = 0) %>% 
  fmt_percent(columns = pct_outcome, decimals = 1) %>% 
  cols_label(cm = "Prediction Category",
             n = "Count",
             pct_outcome = "% Subgroup") %>% 
  tab_header(title = md("**Prediction disparities by race & gender**")) %>%
  opt_align_table_header(align = "left") %>% 
  tab_style(location = cells_row_groups(),
            style = cell_text(weight = "bold")) %>% 
  gt_theme_nytimes() %>% 
  opt_row_striping()
Prediction disparities by race & gender
Prediction Category Count % Subgroup
Men - Black
False Negative 693 10.4%
False Positive 1,526 22.8%
True Negative 1,060 15.8%
True Positive 3,412 51.0%
Men - White
False Negative 576 13.0%
False Positive 874 19.7%
True Negative 955 21.5%
True Positive 2,041 45.9%
Women - Black
False Negative 44 8.7%
False Positive 168 33.3%
True Negative 115 22.8%
True Positive 178 35.2%
Women - White
False Negative 104 10.0%
False Positive 301 28.8%
True Negative 255 24.4%
True Positive 384 36.8%

Estimating a “cost” of recidivism

It is difficult to evaluate the economic impacts of recidivism, as the negative inverse of recidivism (keeping someone in prison), is also a very costly enterprise.

To attempt to estimate the financial implication of our model, we utilize the following assumptions:

  • The cost of imprisonment in Pennsylvania costs an average of $150,000 for three years
  • The value of a person living and working outside of prison is $150,000 for three years. This can be a positive value for someone justly released from prison or a negative value for someone unjustly kept in prison.
  • The cost of a criminal trial due to recidivism is $15,000
  • Recidivism occurs at an average of 1.5 years after release.

Together, these assumptions led us to the following cost estimates:

  • True Negative (no recidivism, no prison): +$300,000
  • True Positive (recidivism, three years of prison): -$100,000
  • False Negative (recidivism, 1.5 years of jail): -$90,000
  • False Positive (no recidivism, 1.5 years of jail) -$300,000
recid_test <- recid_test %>%
  filter(!is.na(pred_outcome))

cost_benefit_table <-
   recid_test %>%
      count(pred_outcome, outcome) %>%
      summarize(True_Negative = sum(n[pred_outcome==0 & outcome==0]),
                True_Positive = sum(n[pred_outcome==1 & outcome==1]),
                False_Negative = sum(n[pred_outcome==0 & outcome==1]),
                False_Positive = sum(n[pred_outcome==1 & outcome==0])) %>%
       gather(Variable, Count) %>% 
       mutate(Revenue =
               ifelse(Variable == "True_Negative", Count * 0,
               ifelse(Variable == "True_Positive",((.35 - .1) * Count),
               ifelse(Variable == "False_Negative", (-0.35) * Count,
               ifelse(Variable == "False_Positive", (-0.1) * Count, 0))))) %>%
    bind_cols(data.frame(Description = c(
              "We correctly predicted no recidivism",
              "We correctly predicted recidivism",
              "We predicted no recidivism and person reoffended",
              "We predicted recidivism and person did not reoffend")))
#True negative (+$300,000)
#Person reintegrated into society, 50k  /year on saved prison costs + 50k/year on benefit to society


#True positive (-$100,000)
#Cost of imprisonment per year in PA is $50,000/yr, saved cost of committed crime is $50k
#https://www.vera.org/publications/what-jails-cost-cities/philadelphia-pa

#False Negative(-$90,000)
#Estimated cost of a criminal trial at $15,000 + average of another 1.5 years in jail, $75,000
#Cost of person committing crime $50,000

#False Positive (-$300,000)
#Cost of imprisonment plus 50k / year opportunity cost of person not being in society


alt_cost_benefit_table <-
   recid_test %>%
      count(pred_outcome, outcome) %>%
      summarize(True_Negative = sum(n[pred_outcome==0 & outcome==0]),
                True_Positive = sum(n[pred_outcome==1 & outcome==1]),
                False_Negative = sum(n[pred_outcome==0 & outcome==1]),
                False_Positive = sum(n[pred_outcome==1 & outcome==0])) %>%
       gather(Variable, Count) %>% 
       mutate(Cost =
               ifelse(Variable == "True_Negative", Count * 300000,
               ifelse(Variable == "True_Positive",((-100000) * Count),
               ifelse(Variable == "False_Negative", (-90000) * Count,
               ifelse(Variable == "False_Positive", (-300000) * Count, 0))))) %>%
    bind_cols(data.frame(Description = c(
              "We correctly predicted no recidivism",
              "We correctly predicted recidivism",
              "We predicted no recidivism and person reoffended",
              "We predicted recidivism and person did not reoffend"))) %>% 
  mutate(Cost = Cost/1000000)


alt_cost_benefit_table <- alt_cost_benefit_table %>% 
  mutate(Variable = recode(Variable, 
                           "True_Negative" = "True Negative", 
                           "True_Positive" = "True Positive", 
                           "False_Negative" = "False Negative", 
                           "False_Positive" = "False Positive"))

gt(alt_cost_benefit_table) %>% 
  fmt_number(columns = Count, decimals = 0) %>% 
  fmt_currency(columns = Cost, decimals = 0) %>% 
  cols_label(Variable = "Outcome") %>% 
  tab_header(title = md("**Cost-benefit analysis**")) %>% 
  opt_align_table_header(align = "left") %>% 
  cols_label(Cost = "Cost ($Million)") %>% 
  gt_theme_nytimes() %>% 
  opt_row_striping()
Cost-benefit analysis
Outcome Count Cost ($Million) Description
True Negative 2,385 $716 We correctly predicted no recidivism
True Positive 6,015 −$602 We correctly predicted recidivism
False Negative 1,417 −$128 We predicted no recidivism and person reoffended
False Positive 2,869 −$861 We predicted recidivism and person did not reoffend

In total, our model shows the government losing $875 million dollars a year. While our model may save as much as $716 million dollars for releasing prisoners early, the real lesson from this exercise is that prison is extremely expensive, and should be avoided whenever possible. However, as rates of recidivism show, prison itself is not a deterrent of further time in prison. Punishment for criminal activity needs to include non-carceral rehabilitation that will save the government money in the long run.

Conclusion

Overall, we determine that social determinants of criminality, including level of education, employment, and race have a significant impact on individuals’ likelihood of recidivism. At the same time, though our model can help the government release prisoners early who have low likelihoods of recidivism, it does nothing to resolve the overarching problem of high recidivism rates.

For further research, we recommend pairing the underlying social and economic causes of the factors that most reliably predict recidivism with appropriate ameliorative programming. These could include:

  • Where “some college” reduces recidivism, increase access to higher education
  • Where “percent of days employed” decreases recidivism, improve professional opportunities after release
  • Where property offenses increase recidivism, offer post-incarceration services targeted at those who committed property offenses

While prison may be an important tool for managing public safety, the best economic and social tools for improving outcomes in the criminal justice system are keeping people out of prison in the first place, and helping them stay out when they’re released.

Use of our model

While it is important to help people stay out of prison, our model can still be useful in and of itself. Especially given the low accuracy of our model, it should be used as a tool for matching prisoners with support services, rather than as a decider of whether or not someone should be released or kept in prison. For example, if a prisoner has a high log-likelihood of recidivism, instead of keeping them in prison, our model could identify them as a target for specific support services once released.The model can also be refined with future observations of trends in recidivism, with those added support systems.

Appendix

See graphics below that were not included in our final modeling, but offer extra evidence of the interaction between personal characteristics and recidivism. We also include the distribution of our model’s predictions, based on actual results to show the difficulty of predicting at the margins.

Number of residence changes

Residence changes does not really suggest housing instability is related to recidivism. Nor is gang affiliation (not listed below).

recid_by_moving <- ga_recid %>% 
  group_by(residence_changes, recid_dv_char) %>% 
  summarize(justice_inv_pop = n())

ggplot() +
  geom_col(data = recid_by_moving,
           aes(x = residence_changes, y = justice_inv_pop, fill = residence_changes),
           position = "dodge",
           show.legend = FALSE) +
  scale_fill_brewer(type = "qual",
                    palette = "Paired") +
  scale_y_continuous(labels = scales::label_comma()) +
  facet_wrap(~ recid_dv_char) +
  theme_minimal() +
  labs(title = "Recidivism by number of residence changes",
       x = "Number of residence changes",
       y = "Justice-involved pop.") +
  theme(plot.title = element_text(face = "bold", margin = margin(b = 6)),
        legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1))

Prior gun charges

Episodes of prior gun charges doesn’t predict recidivism in 3 years either.

recid_by_PriorGun <- ga_recid %>% 
  group_by(prior_arrest_episodes_gun_charges, recid_dv_char) %>% 
  summarize(justice_inv_pop = n())

ggplot() +
  geom_col(data = recid_by_PriorGun,
           aes(x = prior_arrest_episodes_gun_charges, y = justice_inv_pop, fill = prior_arrest_episodes_gun_charges),
           position = "dodge",
           show.legend = FALSE) +
  scale_fill_brewer(type = "qual",
                    palette = "Paired") +
  scale_y_continuous(labels = scales::label_comma()) +
  facet_wrap(~ recid_dv_char) +
  theme_minimal() +
  labs(title = "Recidivism by episodes of prior gun charges",
       x = "Episodes of prior gun charges",
       y = "Justice-involved pop.") +
  theme(plot.title = element_text(face = "bold", margin = margin(b = 6)),
        legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1))

Distribution of predictions

palette2 <- c("#ff7f00","#1f78b4")

ggplot(subset(recid_test, pred_outcome != "NA"), aes(x = probs, fill = as.factor(outcome))) + 
  geom_density(alpha = .75,
               color = NA) +
  scale_fill_manual(name = "Outcome",
                    labels = c("Did not reoffend (0)", "Reoffended (1)"),
                    values = palette2) +
  labs(x = "Recidivism", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome") +
  theme_minimal() +
  theme(legend.position = "bottom",
        plot.title = element_text(face = "bold", margin = margin(b = 6)),
        strip.text.x = element_text(size = 18))